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Abstract 

A long-running difficulty with conventional game theory has been how 
to modify it to accommodate the bounded rationality characterizing all 
real-world players. A recurring issue in statistical physics is how best to 
approximate joint probability distributions with decoupled (and therefore 
far more tractable) distributions. It has recently been shown that the 
same information theoretic mathematical structure, known as Probability 
Collectives (PC) underlies both issues. This relationship between statis- 
tical physics and game theory allows techniques and insights from the 
one field to be applied to the other. In particular, PC provides a formal 
model- independent definition of the degree of rationality of a player and 
of bounded rationality equilibria This pair of papers extends previous 
work on PC by introducing new computational approaches to effectively 
find bounded rationality equilibria of common-interest (team) games. 


1 INTRODUCTION 

The fields of statistical physics, game theory, and distributed control/optimization 
share one fundamental characteristic: they are all concerned with how the prob- 
ability distribution governing a distributed system is related to the functionals 
that it optimizes. This shared characteristic provides the basis for a mathe- 
matical language for translating many of the concepts of those fields into one 
another. This mathematical language is known as Probability Collectives (PC) 
[1, 2, 3, 4, 5, 6]. By allowing us to transfer theory and techniques between those 
fields, it provides a means of unifying them. 

This pair of papers introduces computational techniques from PC for ef- 
ficiently finding bounded rational equilibria of noncooperative games. In this 
paper first we review PC and how to use it to formalize bounded rationality. We 
then review two of the previously explored techniques for finding bounded ratio- 
nal equilibria, Brouwer updating and Nearest Newton updating. After this we 
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introduce iterative focusing, a new set of techniques for finding full rationality 
equilibria. 

Due to space limitations, several other schemes for finding bounded rational 
equilibria could not be presented in this first paper. They are instead introduced 
in the second paper [7]. That second paper also shows how to extend all of the 
approaches for finding equilibria (from both papers) to the case of uncountable 
move spaces of the players. Some issues that arise in practice when running 
these algorithms are also discussed there. 

The version of Probability Collectives considered in this paper, involving 
product distributions, is called “Product Distribution” (PD) theory [1]. It’s 
important to note that PD theory also ha s many applications in science beyond 
those considered in this paper. For example, see [3, 4, 8, 9, 10, 5, 6, 11] for 
work concerning distributed control and to distributed optimization. See also 
[12, 13, 10] for work showing, respectively, how to use PD theory to improve 
Metropolis-Hastings sampling, how to relate it to the mechanism design work 
in [14, 15, 16, 17], and how to extend it to continuous move spaces and time- 
extended strategies. 

Throughout these papers 5 functions are either Dirac or Kronecker as ap- 
propriate, integrals implicitly have a measure appropriate to the cardinality of 
the underlying space, and © is the Heaviside step function. 

2 Review of Probability Collectives 

2.1 Game theory and Statistical Physics 

In noncooperative game theory one has a set of N players, each choosing its 
(normal form) strategy Xi independently, by sampling a distribution qi(xi) over 
those strategies. Each player i also has her own cost function gi(x), specifying 
how much punishment she gets for every possible joint-strategy x of all N play- 
ers. 1 Let q-i(x-i) mean the joint probability distribution of all players other 
than z, i.e., Ylj^i Qj ( x j)- Then the “goal” of each player i is to set qt so that, 
when conditioned on <?_*, the expected value of z’s cost is as low as possible. 

Conventional noncooperative ngame theory assumes each player i is “fully 
rational”, i.e., able to solve for that optimal qi , and that she uses that optimal 
distribution. It is primarily concerned with analyzing such equilibria of the 
game [18, 19, 20, 21]. In the real world, this assumption of full rationality 
almost never holds, whether the players are humans, animals, or computational 
agents [22, 23, 24, 25, 26, 27, 28, 29, 30]. This is due to the cost of computation 
of that optimal distribution, if nothing else. This real-world bounded rationality 
is one of the major impediments to applying conventional game theory in the 
real world. 

1 A “cost” function is just the negative of a utility function. We work with costs rather 
than utilities to agree with usage in the optimization and physics communities, where the goal 
is minimization of objective functions rather than maximization. 



Perhaps the most succinct and principled way of deriving statistical physics 
is as an application of the Maximum Entropy (Maxent) principle of informa- 
tion theory [31, 32, 33]. In this formulation, the problem of statistical physics 
is cast as how best to infer the probability distribution over a system’s states 
when one’s prior knowledge consists purely of the expectation values of certain 
functions of the system’s state [34, 33]. For example, this prescription says 
we should infer that the probability distribution p governing the system is the 
Boltzmann distribution when our prior knowledge is the system’s expected en- 
ergy. This scenario known as the “canonical ensemble”; other ensembles arise 
when other expectation values are added to one’s prior knowledge. In particu- 
lar, if the number of particles of various types is uncertain, but one knows their 
expectation values, one arrives at the “grand canonical ensemble”. 

It has recently been recognized that this exact approach used in statistical 
physics provides a principled way to modify conventional game theory to ac- 
commodate bounded rationality [2], Doing so results in the principle that the 
bounded rational equilibrium is the minimizer of a certain set of coupled La- 
grangian functions of the joint distribution, q(x) = rii&O 2 ^) ( see below). The 
resultant distributions are the same ones that arise in the canonical ensemble 
of statistical physics. In particular, the Nash equilibrium of a game simply cor- 
responds to the special case in statistical physics where the temperature equals 
0. Bounded rationality corresponds to non-zero temperature. 

In addition to showing how to formulate bounded rationality, PC provides 
many other benefits to game theory. For example, its formulation of bounded 
rationality explicitly includes a term that, in light of information theory, is 
naturally interpreted as a cost of computation. So from the point of view of 
PC, it is formally correct to say that bounded rationality arises due to the cost 
of computing the Nash equilibrium. As another example, one can apply the 
grand canonical ensemble to game theory rather than the canonical ensemble. 
This extends game theory to allow for stochastic numbers of players of various 
types [1] in addition to bounded rationality of those players. Among other 
things, this results in some novel extensions of evolutionary game theory. 2 As a 
final example, in [2] some algorithms for finding bounded rational equilibria are 
derived that are related to Stackelberg games. More generally, those algorithms 
are related to the problem of finding the optimal control hierarchy for team of 
players with a common goal, i.e., finding an optimal organization chart. 


2.2 The Maxent Lagrangian 

In a nutshell, the perspective adopted by the PC approach to game theory 
is purely empiricist — you are a scientist, with limited information about a 
particular game, and must predict what strategies are being followed in the 
game. The prediction you make is the associated equilibrium concept. 

2 It’s intersting to note that the replicator dynamics of evolutionary game theory also arises 
in PC for games with a fixed number of players. It turns out that it gives the infinitesimal 
time limit of certain 2nd order descent schemes for finding bounded rational equilibria of such 
games [13]. 


Making such a prediction is an exercise in statistical inference, an old and 
well-understood field. No modeling of the players and their thought processes 
need arise in such inference. Of course, if we are provided hard data (!) concern- 
ing how the players behave, those should be taken into account in the inference. 
In addition, models concerning the behaviors of players can be used, if desired, 
for example via Bayesian inference. Note that any priors arising in that infer- 
ence are set by us , the scientists predicting the game, not by the players involved 
in the game. In contrast, if we believe the players to be Bayesians, that simply 
means that our inference algorithm must involve probability distributions over 
the possible priors of the players. 

To formalize this, for simplicity assume a a finite number of possible moves 
(pure strategies) \Xi\ for each player i. Also for simplicity, consider a common 
payoff (aka team) game, in which all players share the same cost function, G(x). 
(The more general case is addressed in [2].) Say that we are told the expected 
value of G. So our prior knowledge is that the players are independent, and 
that their expected cost is some e. The Maxent principle of information theory 
tells us that the optimal estimate of the q for that prior knowledge is given by 
the minimizer of the Maxent Lagrangian 

J?(q) = (3{E q {G) — e] — S(q) 

= P[ f dx - e] - S(q) (1) 

3 

where 0 is the Lagrange parameter implicitly set by the constraint on the ex- 
pected cost [1, 2, 9] 3 . 

Solving, we find that the mixed strategies minimizing the Maxent Lagrangian 
are related to each other via the coupled Boltzmann distributions 

qi (xi) <xe~ E *-< (GM . (2) 

Following Nash, we can use Brouwer’s fixed point theorem to establish that for 
any fixed value /?, there must exist at least one product distribution q solving 
these equations. 

The first term in -S? is minimized by perfectly rational players. The second 
term is minimized by perfectly irrational players, i.e., by uniform mixed strate- 
gies q % . So 0 in the maxent Lagrangian explicitly specifies the balance between 
the rational and irrational behavior of the player. In particular, for 0 — > oo, by 
minimizing the Lagrangian we recover the Nash equilibria of the game: in that 
limit the set of q that minimize the Lagrangian is the same as the set of delta 
functions about the Nash equilibria of the game. 

See [2] for discussion of how the Maxent Lagrangian encompasses cost of 
computation and its extension to non-common- payoff games. Also discussed 
there is the extension to other kinds of prior knowledge, to other kinds of sta- 
tistical inference (e.g., Bayesian inference), and to variable numbers of players. 

3 Throughout this paper the terms in any Lagrangian that restrict distributions to sum to 
one are implicit. The other constraint, that none of its components are negative, will not need 
to be explicitly enforced. 



Also in that work is a discussion relating the Maxent Lagrangian and its deriva- 
tion to other work in the game theory community. See [35] for a discussion of 
the relationship with other techniques in the optimization community. 


2.3 Descent of the Maxent Lagrangian 

Say we wish to find a bounded rational equilibrium, i.e., a minimizer of the 
Maxent Lagrangian, and that to do this we are iteratively evolving q to minimize 
J if for some fixed /?. Say at the current stage of this process we are at some 
point q 6 Q. It is proven in [9] that the direction from q within Q that, to first 
order, results in the largest drop in the value of J£f(<?), is as follows: 


d R 5?{q) 
d R qi{xi = j) 


«i(j) -£«*&)/ l*il, 


(3) 


where Ui(j ) = /3E(G \ X{ = j) + lnfe(j)], and the symbol d R indicates that we 
do not mean the indicated partial derivative, formally speaking, but rather the 
indicated component of the lst-order descent vector 4 . 

Eq. 3 gives the (negative of the) change that each agent should make to its 
distribution to have them jointly implement a step in steepest descent of the 
Maxent Lagrangian. These updates are completely distributed, in the sense 
that each agent’s update at time t is independent of any other agents’ update 
at that time. Typically at any t each agent i knows q t (t) exactly, and therefore 
knows ln[^(j)3- However often it will not know G and/or the q-i. In such cases 
it will not be able to evaluate the E(G | x x = j) terms in Eq. 3 in closed form. 

One way to circumvent this problem is to have those expectation values be 
simultaneously estimated by all agents by repeated Monte Carlo sampling of q 
to produce a set of (r, G(x)) pairs. Those pairs can then be used by each agent 
i to estimate the values E(G | Xi — j), and therefore how it should update its 
distribution. In the simplest version of such an update to q only occurs once 
every L time-steps. Note that only one set of Monte Carlo samples is needed 
for all players to determine how to update their mixed strategy, no matter how 
many players there are. 

In this simple Monte Carlo scheme only the samples (r, G(x)) formed within 
a block of L successive time-steps are used at the end of that block by the 
agents to update their distributions (according to Eq. 3). More sophisticated 
approaches re-use old samples, modify the G values returned by the Monte Carlo 
on a player-by-player basis, etc. [36, 1, 7, 10]. 


2.4 Brouwer updating 

There are many other descent schemes in addition to ones related to steepest 
descent. In Brouwer updating for the team game, an agent i adopts the distri- 

4 Formally speaking, the partial derivative is given by Ui(j). Intuitively, the reason for 
subtracting J2 X ' u i( x i)I\Xi\ is to keep the distribution in the set of all possible probability 
distributions over x, V . 


bution qi given by Eq. 2 using the current g_ t -. This is the update that would 
minimize the Maxent Lagrangian if all other agents did not change their distri- 
butions. Since this update rule only depends on E(G | #*), ^ can estimated 
using Monte Carlo samples in the usual way. 

In serial Brouwer updating, only one agent at a time performs the up- 
date. The order in which the agents update their distributions can be pre- fixed 
or random. It can also be dynamically determined in a greedy manner, by 
choosing which agent to update based on what associated drop in the Maxent 
Lagrangian would ensue [1, 13, 5]. (These various ordering schemes are simi- 
lar to the those used in the majorization and block relaxation techniques of in 
optimization statistics.) Aside from Monte Carlo estimation error, each step of 
serial Brouwer updating is guaranteed not to increase the associated (Maxent) 
Lagrangian. Note that with serial Brouwer updating, the agents are in effect 
playing a sequence of Stackelberg games[19, 18]. 

In parallel Brouwer updating, this procedure is followed simultaneously by 
all agents. Accordingly, parallel Brouwer updating can be viewed as a variant 
of ficticious play (see [37, 38, 39]). In mixed serial-Brouwer updating, in a 
given iteration one performs a sequence of more than one Brouwer update, each 
update being applied in parallel to a subset of all the agents. Such updating 
can be viewed as a management hierarchy in a human organization, in which 
at each iteration a subset of the people in the organization make their decisions 
in parallel with one another, and are then followed in their decision-making by 
other such subsets of people. 

Now in general, when any qj changes, for every i ^ j, what distribution 
Qi minimizes i's Lagrangian will change, in accord with Eq. 2. This suggests 
that a step of parallel Brouwer updating may “thrash”, and have each agent 
change in a way that confounds the other agents’ changes. (See [37, 39] and 
references therein for analysis of related issues in ficticious play.) In such a 
case the update may not actually decrease the associated (Maxent) Lagrangian, 
unlike with serial Brouwer. 

There are many possible ways of mixing parallel and serial Brouwer updat- 
ing, in which only subsets of the agents perform parallel updates at any given 
time. These can be viewed as management hierarchies, akin to those in hu- 
man organizations. Often such hierarchies can also be determined dynamically, 
during the updating process. 

2.5 Kullback-Leibler distance 

Say that we were not restricting ourselves to product distributions. So the 
Lagrangian becomes J£?(p) = /3(E P (G) — 7 ) — S(p) r where p can now be any 
distribution over x. There is only one local minimum over p of this Lagrangian, 
the canonical ensemble: 

pP(x) oc e~^ G ( x ) 

In general is not a product distribution. However we can ask what product 
distribution is closest to it. Now in general, the proper way to approximate a 


target distribution p with a distribution from a subset C of the set of all 
distributions is to first specify a misfit measure saying how well each member 
of C approximates p, and then solve for the member with the smallest misfit. 
This is just as true when C is the set of all product distributions as when it is 
any other set. 

How best to measure distances between probability distributions is a topic 
of ongoing controversy and research [40]. The most common way to do so is 
with the infini te limit log likelihood of data being generated by one distribution 
but misattributed to have come from the other. This is know as the Kullback- 
Leibler distance [31, 41, 32]: 

KL(p 1 \\p 2 ) = S(p 1 \\p 2 )-S{p 1 ) (4) 

where S(pi || P 2 ) = — J dx pi(x) ln[^^] is known as the cross entropy from 
Pi to P 2 (and as usual we implicitly choose uniform p). The KL distance is 
always non-negative, and equals zero iff its two arguments are identical. 

As shorthand, define the “pq distance” as KL(p || q), and the “qp distance” 
as KL{q || p), where p is our target distribution and q is a product distribution. 
Then it is straightforward to show that the qp distance from q to target distri- 
bution p 3 is just the Maxent Lagrangian _S?(g), up to irrelevant overall additive 
constants. In other words, the q minimizing the Maxent Lagrangian is the same 
as the q having minimal qp distance to the associated canonical ensemble. 

However the qp distance is the (infinite limit of the negative log of) the 
likelihood that distribution p would attribute to data generated by distribution 
q . It can be argued that a better measure of how well q approximates p would 
be based on the likelihood that q attributes to data generated by p. This is the 
pq distance; it gives a different Lagrangian from that of Eq. 1. 

Evaluating, up to an overall additive constant (of the canonical distribution’s 
entropy), the pq distance is 

KL(jpP || q) = -]T) J 

This is equivalent to a game where each coordinate i has the “Lagrangian” 

&*{q) = ~ J dx t pf (Xi)\n[q t (x t )}, (5) 

where pf(x;) is the marginal distribution f dx_ip^(x). Since q is a product 
distribution, the minimizer of this is just qi = pf Vi, i.e., each qi is set to the 
associated marginal distribution of ]A 

2.6 Adaptive importance sampling and Nearest Newton 

This subsection shows how to set q to minimize pq distance. It also shows how 
pq distance provides a tool to perform second order descent over qp distance. 



The most straightforward way to estimate the marginal distribution mini- 
mizing distance to pP is via adaptive importance sampling, where the proposal 
distribution is an earlier version of q itself. In other words, first we write q in 
terms of the an earlier estimate q as follows: 

qi(xi) oc 


To estimate the expectation value we repeatedly IID sample q , getting a set 
of x values. For each such x we evaluate e~^ G ^ /q{x). For all of Vs possible 
moves, a, we then set our estimate of qi(xi = a) to the average of the subset 
of those sample values of e~/ 3G ( x ) fq{x) for which X{ — a. Variants of this 
scheme replace the Boltzmann function e _ ^ G ( x ) with different function that is 
also biased toward low G values, e.g., Q[K— G(x)]. See the discussion of iterative 
focusing below. 

Note that the optimal solution for qi for the pq KL Lagrangian — pf — is 
independent of the This is also true for the variants of Monte Carlo sampling 

with Brouwer updating that are discussed in [7]. However this differs from the 
case for the optimal solution for the qp KL Lagrangian, given in Eq. 2. However 
in both the adaptive importance sampling scheme for the pq KL Lagrangian 
and steepest descent for the qp KL Lagrangian, the update rule for qi depends 
on the distributions via the Monte Carlo sampling. See [1]. 

One potential difficulty with using adaptive importance sampling this way 
is that it could lead to big jumps in Q, with attendant instability arising from 
estimation error. An alternative approach for minimizing pq distance is to only 
use the adaptive importance sampling to provide the information needed to 
perform steepest descent on the Lagrangians Ji?*(q). More precisely, the steepest 
descent direction of that Lagranian for agent i is given by Eq. 3, only now with 
Ui(j) = —Pi{j)/qi(j)- The marginal pf(xi) in this direction can be estimated 
via adaptive importance sampling, just as above. So with such gradient descent 
We are still using adaptive importance sampling to estimate the marginal of p^. 
However now the algorithm only take a small step in a direction determined by 
that estimate of pf . 

Care must be taken when going to second order descent methods. The 
Hessian of Jt?*(q) is diagonal and therefore trivial to invert. It is also positive- 
definite. However applying a Newton-Raphson step with that Hessian, for ex- 
ample, would result in a new q that doesn’t lie on Q. Moreover, the update 
step is independent of p @ . Such problems are even more accute when trying 
to descend the Maxent Lagrangian, since coupling between the agents in the 


i) 


r Q-pGixux'-i) 

/ 

ft(l|) /* ,[T Iir u *(*.) 


e -PG(x') 

dx ' [ ~wr ] ^ x '^ Xi ~ 

e -0G(*’) q(x)5(xi - x'j). 


Qi(Xi) Eq{~ 


-0G 


Xi). 


( 6 ) 


multilinear term E(G) makes the associated Hessian non-diagonal 

An alternative approach for second order descent of the Maxent Lagrangian 
starts by making a quadratic approximation (over the space of all p , not just 
all q ) to the Lagrangian, j£?(p) based on the current point p l , Via Newton’s 
method this specifies a p t+l that minimizes that quadratic approximation. We 
can then find the product distribution that is nearest (in pq KL distance) to 
p t+1 and move to that product distribution. The resultant update rule for the 
Maxent Lagrangian is called Nearest Newton descent [9]: 

^^ = 1 - Sfoa-lnfoto)) 

- p[E qt (G\x i =j)-E qt (G)} (7) 

where q l is the current (assumed to be product) distribution. The conditional 
expectations in Nearest Newton are the same as those in gradient descent. Ac- 
cordingly, they too can be estimated via Monte Carlo sampling, if need be. 

In [13] it is shown that in the continuum time limit, Nearest Newton updat- 
ing for modifying the probability distribution of any particular agent becomes 
a variant of replicator dynamics (with the different strategies of replicator dy- 
namics identified with the different possible moves of the agent performing the 
Nearest Newton update). That paper also shows that that when the terms in 
that continuum limit version of Nearest Newton are Monte-Carlo estimated, 
Nearest Newton becomes a version of ficticious play. More precisely, it becomes 
identical to a “data-aged” continuum time limit of parallel Brouwer updating. 

The stepsize of the Nearest Newton procedure is identical to the constant in the 
exponent of the data-aging. 

3 Iterative Focusing for pq distance 

In common payoff games the ultimate goal is finding argmin x G(r), i.e., aigmm q E(G). 
More generally, say we have a mechanism design scenario where x is the vector 
of variables that the designer can set, and the average responses of the human 
beings to that vector are implicit, being buried (along with the social welfare 
function) in G. Then the goal of the designer is to find the x minimizing G 
[42]. This section introduces a class of schemes for minimizing E q (G), without 
direct concern for the Maxent Lagrangian. 

Given some current distribution, often one can generate distributions that 
will be more peaked about low G(x) than that current distribution by appropri- 
ately transforming it. This suggests a ratchet-like process that iterates a two- 
step procedure: First one applies such a transformation to the current product 
distribution, generating a new distributions that in general need not be a prod- 
uct. Then one finds the q that best approximates that transformed version, to 
get the product distribution for the next iteration. More concretely, the process 
of iterative focusing proceeds as follows: 


1) Based on G, choose an associated focusing operator M : V — * V such that 
Em_( p ){G) < E P (G) for any distribution p. In general M(p) Q. So we aksi 
choose a distance measure D(., .) across V, to that we will use to get back to Q. 

2) Given a product distribution q , solve for argmin <?€ QP(g, At (^)). (In general, 
the finding of that minimizing q may be an extensive multi-step process in its 
own right.) 

3) Set the new q to that solution from (2). Potentially “anneal” M as well, to 
tighten the focusing around x with lower values of G(x). Return to (2). 

We would like to be able to guarantee that in each pass iterative focusing 
produces a distribution that is closer to a delta function about argmin x G(x). If 
M(q) were G 2, so that V were unnecessary, we would have such a guarantee. 
However when this is not the case, the distribution output in step (3) ^ M(q), 
and it may not be superior to the distribution input to step (2). Ultimately 
whether the distribution produced by step (3) is superior to the one input to 
step (2) depends on the relation of D, Q, the q input to step (2), and A4(.). 

The simplest choice of focusing operator is multiplication by an a nowhere- 
negative focusing function Fq(x) that increases as G(x) decreases, followed by 
renormalization : 


Proposition 1: Say Fq is integrable and nowhere- negative and G(x ') < G(x) => 
Fg(x') > F g (x) Vz, x'. Then Fq is a single-valued function of G(x), and in addi- 
tion the focusing property of step (1) is guaranteed for M(p) (x) = j ‘ 


Proof: Write Fq{x) ~ f(G(x)) and let y indicate a generic value of G(x). Then 
the post-focusing distribution over y is 


P(y) 


J dxS(G(x) - y) 


/(G(x))p(s) = f(y)p(y) 

J dx f{G{x))p{x) f dy f(y)p(y) 


where p(y) — f dx p(x)6(y — G(x)). Define the distribution r(y) = j dy's^y)^) • 
Then 


r(y) > 1 
p{y) ~ 
iM < i 

P(y) ~ 


Vy' < y 

vy > y 


rW) 
’ p(y') 
r_ W) 

’ p(y') 


> i, 

< i. 


Therefore there must be a greatest value a such that Vy < a, > 1, and a 
smallest value b such that Vy > 6, < 1. Now by normalization, 



dy[r(y) - p(y)] = ~ j 


dy[r{y) -p(y)}. 



and r(y) — p(y) throughout the range (a, 6). Write the change in expected G 
when multiplying p(x) by Fg(x) and renormalizing as 


A (G) 



dy y[r(y) - p(y )] 
dy y[r{y) - p(y)] + 



dy y[r{y) - p(y)]. 


The first integral is bounded above by a dy[r(y) —p(y)], and the second is 
bounded above by b / b °° dy[r(y) — p(y)]. Combining gives A (G) < 0. QED. 


Without loss of generality, we can take Fq(x) to be a probability distribution. 
As any example, given any distribution p, the focused distribution 


Me(K-a{ 0)(p)(*) 


p(x)Q[K — G(x)] 

J dx f p(x / )Q[K — G(x')] 


is guaranteed to be more peaked about x with small G(x) than is p. So we 
can minimize G(x) by iterating the process of finding the product distribution 
q that best approximates -Me{K-G(.))iP ) and then setting p = q. In doing this 
we can also gradually decrease K , to get distributions that are more and more 
restricted to argmin x G(x). This “annealing” is analogous to the an iteration of 
the process of finding the q that best approximates the Boltzmann distribution 
for a particular 0 and then increasing 0. 

It would be nice to choose a distance measure V to minimize the chances that 
the projection back into Q needed in iterative focusing thwarts the improvement 
given by applying the focusing operator M. However it’s not clear how best to 
do that. So as an alternative, here we focus on the simplest choices of distance 
measure, the pq and qp KL distances. 

Continuing with our example, for the choice ofpg-KL distance as the distance 
measure V , the q that best approximates is just the product of 

the marginal distributions of M^K-G{.))ip)’ So in each pass of the associated 
iterative focusing algorithm, 


f Xi)Q[K — G(xi,x'_ i )] 

Qi[Xi) J dx' q(x')G[K - G(x')} 

q(G < K,Xi) 
q(G < K ) 

= q(xi\G<K) (8) 

oc q(G < K \ Xi)qi(xi) ' (9) 

This new & can be Monte-Carlo estimated by agent i using only observed G 
values, in the usual way. So like gradient descent on the Maxent Lagrangian, 
this update rule is well-suited to a distributed implementation. Indeed, the only 
term that needs Monte Carlo estimating is q(G < K | £»). 



3,1 Iterative focusing and Brouwer updating 

One obvious potential problem with this updating rule is that the randomness of 
the Monte Carlo process might erroneously lead one to set qi(xi = a) to 0 even 
though the x that minimizes G has its V th component equal to a. Once this 
happens recovery is impossible; there is no way for qi(xi = a) to subsequently 
increase from 0. 

We can avoid this problem if we replace the Heaviside focusing function 
with a “softened version” like a logistic function with exponent /? about K , 
©/?,/c( x ) = [1 + . With this change the update rule becomes 


Zifa) 


E(Q PiK | x^q^Xj) 
E(Qj3,K ) 


( 10 ) 


As another alternative, we can replace the Heaviside function with a Bolz- 
mann distribution with exponent /3, getting the update rule 


«»(*») 


E(e 0a | Xi)qi(xi) 


( 11 ) 


where all terms on the righthand side are evaluated under the distribution that 
generated the Monte Carlo samples. 

Unlike the update rule of Eq. 9 or Eq, 10, with Eq. 11 we don’t need to 
specify an annealing schedule under which the focusing function changes as 
the algorithm progresses. The annealing is automatic, due to the fact that 
multiplying by e~^ G ^ at each iteration is the same as increasing /3 by the same 
constant at each iteration, i.e., due to the fact that the Boltzmann distribution 
is a Lie group with parameter /?. 

The update rule of Eq. 11 is similar to that of Eq. 6, only here to form the 
quantity being averaged one does not divide by q(x ). It’s also very similar to 
the Brouwer update rule applied to the team game [5, 1, 43]. However in 
contrast to the potential “thrashing” of parallel Brouwer updating, the update 
in Eq. 1 1 is guaranteed to minimize its associated Lagrangian of pq distance to 
M. e(K-G(.)){q ) (assuming no error in estimating the expectation values). On 
the other hand, in Eq. 11 the Lagrangian is not distance to a fixed distribution, 
as it is with the Maxent Lagrangian of parallel Brouwer updating and with 
the update of Eq. 6. Rather the “target distribution” itself changes from one 
iteration to the next. This is actually the case for all iterative focusing, and is 
the analogue of the thrashing of parallel Brouwer updating. 


3.2 Iterative focusing for qp distance 

Rather than pq KL distance, consider using qp KL distance as the error measure 
for approximating the Boltzmann-focused distribution, q(x) j — e ^ x ^ e -^GWy 
Up to an overall additive constant, that distance is just the Maxent Lagrangian, 
with one difference: we replace the 5(g) term with —KL(q || g). This suggests 
an iterative algorithm which starts with a distribution g, and then searches for 



the g minimizing a modified version of the Maxent Lagrangian Jzf(g) in which 
the 5(g) term is replaced by — KL(q || g). When that search terminates one 
resets q to that minimizing q and starts all over again. 

The natural choice for the initial q is the uniform distribution. In this case 
—KL(q || q) = 5(g), so our KL distance for the first pass of the algorithm 
reduces to the usual form of the Maxent Lagrangian. Therefore the first stage 
of the algorithm proceeds to a local minimum q 1 of the Maxent Lagrangian (via 
Nearest Newton, Brouwer updating, or some such). Once that local minimum 
is found, there are many schemes to break out of it (see [7]). One of the most 
obvious is to simply raise (!) (3 and restart the descent. Such a change to /? is 
equivalent to modifying our current target distribution, p = q j dx , ? 

by multiplying it by another Boltzmann distribution (and then renormalizing). 

The most natural choice for when to change q to q 1 is after one has gone past 
all such local minima to a global minimum (or more generally, to what one hopes 
is a global minimum). However there are many alternatives. For example, one 
could do the replacement at the first local minimum. As opposed to multiplying 
the current target p by a Boltzmann distribution, such a replacement of q would 
make a new target distribution by multiplying g 1 by a Boltzmann distribution. 
Obvious variants of this scheme weave the resetting of the target p more fre- 
quently into the overall process, so that it is updated before a local minimum 
is found. Indeed, one can even have the target reset after every modification of 
g, to be the preceding q. 

As an example of the foregoing, given the current product distribution g, 
the optimal solution Eq. 2 changes to 

B, ^ , -PE qf , (G\ Xi ) 

gf(xO oc qi(xi)e 

So Brouwer updating is now different from what it is in the conventional Maxent 
Lagrangian case, with the distribution q serving as a prior probability p (see 
Eq. 1). If at each 1 4- 1 we update g* 44 to q l , and (as in conventional Brouwer) 
use q l to estimate the expectation value, the update rule is 

g t+1 (xi) oc q\(xi)e~^ E ^ G)[Xi \ (12) 

However just as the parallel, serial, etc., variants of Brouwer updating all 
have their strengths, so there axe reasonable schemes for how to set the updating 
of the distributions on the righthand side of Eq. 12. For example, one might 
replace the g|(x t ) term on the righthand side of Eq. 12 with q\ (xi) for some 
earlier t f . Another possibility is to replace the q l in the exponential on the 
righthand side with q 1 " for some t n < t 3 and more generally one can vary which 
of these replacements gets made when. 

It is interesting to compare the variant of parallel Brouwer given by Eq. 12 
with Nearest Newton applied to the Maxent Lagrangian. If we expand the 
exponential in Eq. 12 to first order we get 

9i +1 (^i) OC 9i( x t)[l — &Eq*{G I ®i)]. 


(13) 


If we now approximate the update of Nearest Newton by removing the In term 
(i.e., by taking temperature to 0, with stepsize changed accordingly), we get an 
update rule almost identical to that of Eq. 13. (The remaining difference is that 
Nearest Newton normalizes the update to stay in V by adding a normalizing 
vector rather than by dividing by a normalizing scalar.) This connection is not 
too surprising, in light of the fact that in the continuum time limit with data- 
aging, Nearest Newton and parallel Brouwer updating become identical, with 
the stepsize of the Nearest Newton identically equal to the data aging-constant 
in the parallel Brouwer [13]. 

In addition to Brouwer updating, gradient descent also changes when we use 
it for iterative focusing of q using qp distance rather than for minimization of 
the Maxent Lagrangian. The term U{ (j ) of Eq. 3 that sets the descent direction 
becomes 

/3E(G|^=j)+ln[|g|]. (14) 

So by iteratively focusing q rather than descending the Maxent Lagrangian we 
penalize q’s that differ from q . If q is updated frequently, this provides an inertia 
effect in the dynamics of q, impeding it from changing too fast. 5 

Nearest Newton also changes when used for iterative focusing rather than 
descending the Maxent Lagrangian. It becomes 


<?* +1 co , 
m 


+ KLiqtm-qXjM^} 

- 0[E q t(G | Xi = j) - E q t(G)]. 


(15) 


We can similarly implement gradient descent, Nearest Newton, or Brouwer up- 
dating for iterative focusing of qp KL distance for other focusing functions be- 
sides the Bolzmann function. For example, for the logistic function as the 
focusing function, we get the same formulas as above, just with G(x) replaced 


throughout by 


lnp+e^ G ^>-*>j 

p 


4 Summary of update rules 

In conventional optimization over Euclidean spaces, one can use many different 
algorithms, including gradient descent, conjugate gradient, Newton’s method, 
quasi-Newton, simulated annealing, genetic algorithms, etc. As proven in [44], 
it is theoretically impossible for any one of these problems to be superior to 
the others overall. Rather which algorithm one should use depends on the 
nature of the function being optimized, the expense of evaluating various kinds 
of information during the optimization, etc. 

Similarly, in the preceding discussion many different PC update rules were 
presented, with the ultimate choice of which rule to use depending on the nature 

5 In the limit where q is always set to the q just before the current one, our gradient descent 
becomes a second order dynamical equation. Iterative focusing based on several previous 
distributions, not just the single distribution q , leads to higher-order dynamics. 



of the function being optimized, the expense of evaluating various kinds of 
information during the optimization, etc* This section compares some of those 
update rules. 


4.1 Multiplicative updating 

All the update rules described above can be written as multiplicative updating. 
The following is a list of the update ratios r q ^{xi) = q t i ^ 1 (xi)/ql(x i ) of some of 
those rules. In all of these Fq is a probability distribution over x that never 
increases between two x’s if G does (e.g., a Boltzmann distribution in G(x)). In 
addition const is always a scalar that ensures the new distribution is properly 
normalized and a is a stepsize. 6 

Gradient descent of qp distance to Fg- 

i r E q t (ln[F G ] | x t ) + In (ql(xi)) , 

1 “ a| m 1_ 

Nearest Newton descent of qp distance to Fq- 

1 - a[£?,.(ln[Jb] | Xi) + ln(9i(*i))] ~ 


const 

Qi( x i) 


const 


(16) 


(17) 


Brouwer updating for qp distance to Fq : 


const x 


e E qt (]li[F G ] N) 


Zifa) 


(18) 


Importance sampling minimization of pq distance to Fg(x): 

Fg 

Const X E q t(— \ Xi) 


(19) 


Iterative focusing of q with focusing function Fq(x) using qp distance 
and gradient descent: 


1 — a{ 


E qt (]n[F G ] 1 Xj) -j 

q'ixi) 


Hfig] 


}- 


const 

q^Xi) 


( 20 ) 


Iterative focusing of q with focusing function Fc(x) using qp distance 
and Nearest Newton: 

1 - a{E q < (1 n[F G ] \ x { ) + - const (21) 

6 As a practical matter, both Nearest Newton and gradient-based updating have to be 
modified in a particular step if their step size is large enough so that they would otherwise 
take one off the unit simplex. This changes the update ratio for that step. See [9]. 


Iterative focusing of q with focusing function Fq(x) using qp distance 
and Brouwer updating: 


const x e E ^ lniFc] |x>) x 444 (22) 

Iterative focusing of q with focusing function Fq(x) using pq distance: 

( 23 ) 

Note that some of these update ratios are themselves proper probability distri- 
butions, e.g., the Nearest Newton update ratio. 

All of these update rules are invariant under rescaling of Fg(x ), i.e., multi- 
plication of Fg(x) by a constant (the term const changes to compensate for the 
new scale). When Fg(x) is an exponential (i.e., Boltzmann function) of G(x), 
such a transformation of Fq(x) is equivalent to adding a constant to G(x ). 
However for the other choices of Fg(x) mentioned above (e.g., Q(K — G(x))), 
there is no simple correspondence between this invariance of the update rule 
and a change to G(x). In particular, for those other Fg , rescaling of Fq(x) does 
not correspond to adding a constant to G(x). Accordingly, unlike the case for 
exponential Fg(x), those non-exponential Fq(x) do not give rise to update rules 
that are invariant under addition of a constant to G(x ). The exponential Fg(x) 
has the other nice property that rescaling G(x) is equivalent to just translating 
the exponent constant /3, i.e., to tightening Fg(x) about x with low G(x). Just 
as other Fg(x) do not have nice behavior under addition of a constant to G(x), 
they also do not have this nice character under rescaling. 

Simple modifications to non-exponential Fg typically allows them to share 
these nice characteristics. These modifications involve dynamically replacing 
constants in those Fg, in particular replacing constants that might otherwise 
be annealed according to a pre-j&xed schedule. Typically in the place of such 
constants one uses explicit functions of q. For example, we can replace @(K — 
G(x)) with Q(Eg(G) — G(x)), or with @( 7 r(e, G) — G(x)), where 7 r(e, < 7 , G) is 
the value of G(x) in the best e percentile under distribution q. For either of these 
replacements Fg(x) is invariant under both translation and scaling of G(x). 

Say one has a set of multiple objective functions / constraints {G*} rather 
than just a single, unconstrained function G. 7 Some of the update rules pre- 
sented above can be modified to simultaneously improve all of those functions. 
One does this by replacing_“FG(x)” throughout the update rule with some 
function Fjg*}^)* That function is designed so that its global minimum is a 
Pareto optimal x (according to the {G*}), and so that it cannot decrease (i.e., 
improve) in going from one x to another if any of the Fg* (x) increase in that 

7 For current purposes, we can cast constraints over x as objective functions, so that the 
expectation of those functions equals 0 iff the underlying distribution has its support restricted 
to x meeting the constraints. 


const x Eq(Fc \ x*) x 


q{?i) 

qK x i) 


transition. 4 * * * 8 A simple example, applicable for update rules using pq distance, is 
F {Gi} (x) = UMK t -G i(x)). In practice of course, the choice of the function 
F{ Gi }(x) and the update rule one is modifying will have a crucial effect on how 
prone the algorithm is to getting caught at local critical points. 

These update rules can be broadly grouped into two distinct sets based on 
what guarantees they have. Say we can evaluate every term in each update rule 
in closed form, or alternatively that our Monte Carlo estimate is exact. Then 
there is a ^-independent target distribution, p *, and a distance measure X>, such 
that each update of q in the serial Brouwer version of Eq. 18 is guaranteed not 
to increase T>(q,p*). (In this case p* is the Boltzmann distribution over values 
of Fq, and V is qp distance.) The same is true for for adaptive importance 
sampling minimization of pq distance, Eq. 19. (Again p* is the Boltzmann 
distribution in Fq.) For small enough step size, we also have this guarantee for 
gradient descent of qp distance and Nearest Newton. None of the other update 
rules have such guarantees. 

Finally, it is worth re-emphasizing that for the update rule of adaptive im- 
portance sampling minimization of pq distance, at equilibrium each q z is in- 
dependent of the distributions q (It’s just the marginal of the Boltzmann 
distribution p&.) As mentioned previously, the same property holds for the 
variants of the Monte Carlo process discussed in the section on modifying the 
Monte Carlo process for Brouwer updating discussed in [7]. This means one 
can use samples from all the previous Monte Carlo blocks with impunity. You 
don’t have to worry that the samples of those earlier blocks were formed under 
a different and therefore would lead to a different update from the one 
appropriate for the current Monte Carlo block. In addition each of these two 
algorithms have a single equilibrium, given by the Boltzmann distribution p&. 
In this, they have no local minima problems. None of the other update rules 
have such a set of guarantees. 
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